143 research outputs found

    Hybrid LSH: Faster Near Neighbors Reporting in High-dimensional Space

    Get PDF
    We study the rr-near neighbors reporting problem (rr-NN), i.e., reporting \emph{all} points in a high-dimensional point set SS that lie within a radius rr of a given query point qq. Our approach builds upon on the locality-sensitive hashing (LSH) framework due to its appealing asymptotic sublinear query time for near neighbor search problems in high-dimensional space. A bottleneck of the traditional LSH scheme for solving rr-NN is that its performance is sensitive to data and query-dependent parameters. On datasets whose data distributions have diverse local density patterns, LSH with inappropriate tuning parameters can sometimes be outperformed by a simple linear search. In this paper, we introduce a hybrid search strategy between LSH-based search and linear search for rr-NN in high-dimensional space. By integrating an auxiliary data structure into LSH hash tables, we can efficiently estimate the computational cost of LSH-based search for a given query regardless of the data distribution. This means that we are able to choose the appropriate search strategy between LSH-based search and linear search to achieve better performance. Moreover, the integrated data structure is time efficient and fits well with many recent state-of-the-art LSH-based approaches. Our experiments on real-world datasets show that the hybrid search approach outperforms (or is comparable to) both LSH-based search and linear search for a wide range of search radii and data distributions in high-dimensional space.Comment: Accepted as a short paper in EDBT 201

    Press freedom and reporting on the government in Myanmar

    Get PDF
    Professional project report submitted in partial fulfillment of the requirements for the degree of Masters of Arts in Journalism from the School of Journalism, University of Missouri--Columbia.This project investigates the state of press freedom in Myanmar by comparing reporting on the government in Myanmar before and after the government lifted central censorship on print media in August 2012. Eleven semi-structured interviews were conducts. Subjects are twelve experienced journalists, who have been covering Myanmar for years. The results show a significant change in the condition of working as well as the attitude towards reporting in Myanmar. While skepticism on a total free-press is still there, many journalists are optimistic about the future.Includes bibliographic references

    A Near-linear Time Approximation Algorithm for Angle-based Outlier Detection in High-dimensional Data

    Get PDF
    Outlier mining in d-dimensional point sets is a fundamental and well studied data mining task due to its variety of ap-plications. Most such applications arise in high-dimensional domains. A bottleneck of existing approaches is that implicit or explicit assessments on concepts of distance or nearest neighbor are deteriorated in high-dimensional data. Follow-ing up on the work of Kriegel et al. (KDD ’08), we inves-tigate the use of angle-based outlier factor in mining high-dimensional outliers. While their algorithm runs in cubic time (with a quadratic time heuristic), we propose a novel random projection-based technique that is able to estimate the angle-based outlier factor for all data points in time near-linear in the size of the data. Also, our approach is suitable to be performed in parallel environment to achieve a parallel speedup. We introduce a theoretical analysis of the quality of approximation to guarantee the reliability of our estima-tion algorithm. The empirical experiments on synthetic and real world data sets demonstrate that our approach is effi-cient and scalable to very large high-dimensional data sets

    Revisiting Wedge Sampling for Budgeted Maximum Inner Product Search

    Full text link
    Top-k maximum inner product search (MIPS) is a central task in many machine learning applications. This paper extends top-k MIPS with a budgeted setting, that asks for the best approximate top-k MIPS given a limit of B computational operations. We investigate recent advanced sampling algorithms, including wedge and diamond sampling to solve it. Though the design of these sampling schemes naturally supports budgeted top-k MIPS, they suffer from the linear cost from scanning all data points to retrieve top-k results and the performance degradation for handling negative inputs. This paper makes two main contributions. First, we show that diamond sampling is essentially a combination between wedge sampling and basic sampling for top-k MIPS. Our theoretical analysis and empirical evaluation show that wedge is competitive (often superior) to diamond on approximating top-k MIPS regarding both efficiency and accuracy. Second, we propose a series of algorithmic engineering techniques to deploy wedge sampling on budgeted top-k MIPS. Our novel deterministic wedge-based algorithm runs significantly faster than the state-of-the-art methods for budgeted and exact top-k MIPS while maintaining the top-5 precision at least 80% on standard recommender system data sets.Comment: ECML-PKDD 202

    Kinetics and mechanism of various iron transformations in natural waters at circumneutral pH.

    Full text link
    In this thesis, the implementation and results of studies into the effect of pH on the kinetics of various iron transformations in natural waters are described. Specific studies include i) the oxidation of Fe(II) in the absence and presence of both model and natural organic ligands, ii) the complexation of Fe(III) by model organic compounds, and iii) the precipitation of Fe(III) through the use of both laboratory investigations of iron species and kinetic modeling. In the absence of organic ligands, oxidation of nanomolar concentrations of Fe(II) over the pH range 6.0 -- 8.0 is predominantly controlled by the reaction of Fe(II) with oxygen and with superoxide while the disproportionation of superoxide appears to be negligible. Oxidation of Fe(II) by hydrogen peroxide, back reduction of Fe(III) by superoxide and precipitation of Fe(III) have been shown to exert some influences at various stages of the oxidation at different pH and initial Fe(II) concentrations. In the presence of organic ligands, different effects on the Fe(II) oxidation kinetics is shown with different organic ligands, their initial concentrations and with varying pH. A detailed kinetic model is developed and shown to adequately describe the kinetics of Fe(II) oxidation in the absence and presence of various ligands over a range of concentrations and pH. The applicability of the previous oxidation models to describe the experimental data is assessed. Rate constants for formation of Fe(III) by a range of model organic compounds over the pH range 6.0 -- 9.5 are determined. Variation of rate constants for Fe(III) complexation by desferrioxamine B and ethylenediaminetetraacetate with varying pH is explained by an outer-sphere complexation model. The significant variation in rate constants of Fe(III) complexation by salicylate, 5-sulfosalicylate, citrate and 3,4-dihydroxylbenzoate with varying pH is possibly due to the presence of different complexes at different pH. The results of this study demonstrate that organic ligands from different sources may influence the speciation of iron in vastly different ways. The kinetics of Fe(III) precipitation are investigated in bicarbonate solutions over the pH range 6.0 -- 9.5. The rate of precipitation varies by nearly two orders of magnitude with a maximum rate constant at a pH of around 8.0. The results of the study support the existence of the dissolved neutral species Fe(OH)30 and suggests that it is the dominant precursor in Fe(III) polymerization and subsequent precipitation at circumneutral pH. Variation in the precipitation rate constant over the pH range considered is consistent with a mechanism in which the kinetics of iron precipitation are controlled by rates of water exchange in dissolved iron hydrolysis species

    On the Power of Randomization in Big Data Analytics

    Get PDF

    I/O-Efficient Similarity Join

    Get PDF
    We present an I/O-efficient algorithm for computing similarity joins based on locality-sensitive hashing (LSH). In contrast to the filtering methods commonly suggested our method has provable sub-quadratic dependency on the data size. Further, in contrast to straightforward implementations of known LSH-based algorithms on external memory, our approach is able to take significant advantage of the available internal memory: Whereas the time complexity of classical algorithms includes a factor of N-rho, where rho is a parameter of the LSH used, the I/O complexity of our algorithm merely includes a factor (N/M)(rho), where N is the data size and M is the size of internal memory. Our algorithm is randomized and outputs the correct result with high probability. It is a simple, recursive, cache-oblivious procedure, and we believe that it will be useful also in other computational settings such as parallel computation

    Effects of ASE noise and dispersion chromatic on performance of DWDM networks using distributed Raman amplifiers

    Get PDF
    We investigate effects of amplified spontaneous emission noise (ASE), noise figure (NF) and dispersion chromatic on the performance of DWDM networks using distributed optical fiber Raman amplifiers (DRAs) in two different pump configurations, i.e., forward and backward pumping. We found that the pumping configurations, ASE noise, and dispersion play an important role in network performance improving since it reduces noise figure and bit error rate (BER) of the system. Simulation results show that the lowest bit error rate and noise figure when using forward pumping configuration. Moreover, we have also compared ASE noise powers of the simulation with these of the experiment, they are match
    corecore